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Foreword 

This Technical Specification has been produced by the 3GPP. 

This document specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission 
(DTX) as described in [3]. 

The contents of the present document are subject to continuing work within the TSG and may change 
following formal TSG approval. Should the TSG modify the contents of this TS, it will be re-released by the 
TSG with an identifying change of release date and an increase in version number as follows: 

Version x.y.z 

where: 

x the first digit: 

1 presented to TSG for information; 

2 presented to TSG for approval; 

3 Indicates TSG approved document under change control. 

y the second digit is incremented for all changes of substance, i.e. technical enhancements, 
corrections, updates, etc. 

z the third digit is incremented when editorial only changes have been incorporated in the 
specification; 
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1 



Scope 



This document specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission 
(DTX) as described in [3]. 

The requirements are mandatory on any VAD to be used either in User Equipment (UE) or Base Station 
Systems (BSS)s that utilize the AMR wideband speech codec. 



2 Normative References 

This TS incorporates by dated and undated reference, provisions from other publications. These normative 
references are cited in the appropriate places in the text and the publications are listed hereafter. For dated 
references, subsequent amendments to or revisions of any of these publications apply to this TS only when 
incorporated in it by amendment or revision. For undated references, the latest edition of the publication 
referred to applies. 

"ANSI-C code for the Adaptive Multi-Rate Wideband speech codec" . 

"AMR Wideband Speech Codec; Speech Transcoding Functions" . 

"AMR Wideband Speech codec; Source Controlled Rate Operation". 

[4] ITU, The International Telecommunications Union, Blue Book, Vol. Ill, Telephone 

Transmission Quality, IXth Plenary Assembly, Melbourne, 14-25 November, 1988, 
Recommendation G.71 1 , Pulse code modulation (PCM) of voice frequencies. 



[1] 


3GPPTS26.173 


[2] 


3GPPTS 26.190 


[3] 


3GPPTS26.193 



3 Technical Description 

3.1 Definitions, symbols and abbreviations 

3.1.1 Definitions 

For the purposes of this TS, the following definitions apply: 

frame: Time interval of 20 ms corresponding to the time segmentation of the speech 
transcoder. 

3.1.2 Symbols 

For the purposes of this TS, the following symbols apply. 

3.1.2.1 Variables 

bckr_est[n] background noise estimate at the frequency band "n" 

burst_count counts length of a speech burst, used by VAD hangover addition 

hang_count hangover counter, used by VAD hangover addition 

level[n] signal level at the frequency band "n" 

new_speech pointer of the speech encoder, points a buffer containing last received samples of a 
speech frame [2] 

noise level estimated noise level 
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pow_sum input power 

s(i) samples of the input frame 

snr_sum measure between input frame and noise estimate 

speechjevel estimated speech level 

stat_count stationary counter 

stat_rat measure indicating stationary of the input frame 

tone_flag flag indicating the presence of a tone 

vadjhr VAD threshold 

VADJIag Boolean VAD flag 

vadreg intermediate VAD decision 



3.1.2.2 Constants 

ALPHA_UP1 

ALPHA_DOWN1 

ALPHA_UP2 

ALPHA_DOWN2 

ALPHA3 

ALPHA4 

ALPHAS 

BURST_HIGH 

BURST_P1 

BURST_SLOPE 

COEFF3 

COEFF5_1 

COEFF5_2 

HANG_HIGH 

HANGJ.OW 

HANG_P1 

HANG_SLOPE 

FRAME_LEN 

MIN_SPEECH_LEVEL1 

MIN_SPEECH_LEVEL2 

MIN_SPEECH_SNR 

NO_P1 

NO SLOPE 



constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating average signal level (see subclause 3.3.5.2) 
constant for updating average signal level (see subclause 3.3.5.2) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
coefficient for the filter bank (see subclause 3.3.1 ) 
coefficient for the filter bank (see subclause 3.3.1 ) 
coefficient for the filter bank (see subclause 3.3.1 ) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
size of a speech frame, 256 samples (20 ms) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
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NOISE_MAX 

NOISE_MIN 

POW_TONE_THR 

SP_ACTIVITY_COUNT 

SP_ALPHA_DOWN 

SP_ALPHA_UP 

SP_CH_MAX 

SP_CH_MIN 

SP_EST_COUNT 

SP_P1 

SP_SLOPE 

STAT_COUNT 

STAT_THR 

STAT_THR_LEVEL 

THR_HIGH 

TONE_THR 

VAD POW LOW 



3.1.2.3 



maximum value for noise estimate (see subclause 3.3.5.2) 
minimum value for noise estimate (see subclause 3.3.5.2) 
threshold for tone detection (see subclause 3.3.5) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
threshold for stationary detection (see subclause 3.3.5.2) 
threshold for stationary detection (see subclause 3.3.5.2) 
threshold for stationary detection (see subclause 3.3.5.2) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
threshold for tone detection (see subclause 3.3.3) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 



/ 

|x| 
AND 
OR 

b 



Functions 
Addition 
Subtraction 
Multiplication 
Division 

absolute value of x 
Boolean AND 
Boolean OR 



^ x(n) = x(a) + x(a + 1) + . . . + x(b- 1) + x(b) 



MIN(x,y) 



MAX(x,y) 



\x,x< y 

[y,y<* 

\x,x> y 

\y,y>* 
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3.1.3 Abbreviations 



ANSI 
DTX 
VAD 
CNG 



American National Standards Institute 
Discontinuous Transmission 
Voice Activity Detector 
Comfort Noise Generation 



3.2 



General 



The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be 
transmitted, e.g. speech, music or information tones. The output of the VAD algorithm is a Boolean flag 
(VAD_flag) indicating presence of such signals. 



3.3 Functional description 



The block diagram of the VAD algorithm is depicted in Figure 1 . The VAD algorithm uses parameters of the 
speech encoder to compute the Boolean VAD flag (VAD_flag). This input frame for VAD is sampled at the 
6.4 kHz frequency and thus it contains 256 samples. Samples of the input frame (s(i)) are divided into sub- 
bands and level of the signal (level[n]) in each band is calculated. Input for the tone detection function are 
the normalized open-loop pitch gains which are calculated by open-loop pitch analysis of the speech 
encoder. The tone detection function computes a flag (tone_flag) which indicates presence of a signalling 
tone, voiced speech, or other strongly periodic signal. Background noise level (bckr_est[n]) is estimated in 
each band based on the VAD decision, signal stationarity and the tone-flag. Intermediate VAD decision is 
calculated by comparing input SNR (level[n]/bckr_est[n]) to an adaptive threshold. The threshold is adapted 
based on noise and long term speech estimates. Finally, the VAD flag is calculated by adding hangover to 
the intermediate VAD decision. 



S(i) 



ol_gain 



Filter bank 
and 

computation 
of sub-band 
levels 



Tone 
detection 




level[n] 



tonejlag 



VAD 
decision 



VAD Jag 



Figure 1. Simplified block diagram of the VAD algorithm 

3.3.1 Filter bank and computation of sub-band levels 

The input signal is divided into frequency bands using a 12-band filter bank (Figure 2). Cut-off frequencies for 
the filter bank are shown in Table 1. 
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Table 1. Cut-off frequencies for the filter bank 



Band number 


Frequencies 


1 


0-200 Hz 


2 


200 - 400 Hz 


3 


400 - 600 Hz 


4 


600 - 800 Hz 


5 


800 -1200 Hz 


6 


1200 -1600 Hz 


7 


1600 -2000 Hz 


8 


2000 - 2400 Hz 


9 


2400 - 3200 Hz 


10 


3200 - 4000 Hz 


11 


4000 - 4800 Hz 


12 


4800 - 6400 Hz 



Input for the filter bank is a speech frame pointed by the new_speech pointer of the speech encoder [1]. 
Input values for the filter bank are scaled down by one bit. This ensures safe scaling, i.e. saturation can not 
occur during calculation of the filter bank. 



5th order 
filter block 



5th order 
filter block 



5th order 
filter block 



3rd order 
filter block 



5th order 
filter block 



5th order 
filter block 



~u 


3rd order 
filter block 








3rd order 
filter block 


-^ 



3rd order 
filter block 



3rd order 
filter block 



-► 4.8 -6.4 kHz 
-> 4.0 - 4.8 kHz 



-> 3.2 - 4.0 kHz 



-> 2.4 -3.2 kHz 



-> 2.0 -2.4 kHz 
-> 1.6 -2.0 kHz 



->1.2- 1.6 kHz 



-> 0.8- 1.2 kHz 
-> 0.6 -0.8 kHz 

-> 0.4 -0.6 kHz 



3rd order 
filter block 



0.2 -0.4 kHz 



0.0 - 0.2 kHz 



Figure 2. Filter bank 
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The filter bank consists of 5 th and 3 rd order filter blocks. Each filter block divides the input into high-pass and 
low-pass parts and decimates the sampling frequency by 2. The 5 th order filter block is calculated as follows: 



x lp (0 = 0.5* (A (x(2 *i)) + A 2 (x(2*i + 1))) 

x hp (i) = 0.5 * (A, (x(2 * 0) - A 2 (x(2 * i + 1))) 

where 

x(i) input signal for a filter block 

x lp (i) low-pass component 

x h P d) high-pass component 
The 3 rd order filter block is calculated as follows: 

x lp (i) = 0.5 * (x(2 * i + 1) + A 3 (x(2 * /))) 

x hp (i) = 0.5 * (x(2 * i + 1) - A 3 (x(2 * /))) 



(1a) 
(1b) 



(2a) 
(2b) 



The filters A x () , A 2 () , and A,()are first order direct form all-pass filters, whose transfer function is given by: 



A(z) = 



\ + C*z 



(3) 



where C is the filter coefficient. 



Coefficients for the all-pass filters A,() ,A 2 Q, andA 3 () areCOEFF5_1, COEFF5_2, and COEFF3, 
respectively. 

Signal level is calculated at the output of the filter bank at each frequency band as follows: 



level(n)= ^\x n (i)\, 



(4) 



i=START 



where: 



n index for the frequency band 

x n (i) sample i at the output of the filter bank at frequency band n 



START = \ 



END„ 



-6, l<n<4 
-12, 5<n<8 
-24, 9<n<ll 
-48, n=12 

7, l<n<4 
15, 5<n<8 
31, 9<n<ll 
63, n=12 
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Negative indices of x n (i) refer to the previous frame. 

3.3.2 Tone detection 

The purpose of the tone detection function is to detect information tones, vowel sounds and other periodic 
signals. The tone detection uses normalized open-loop pitch gains (ol_gain), which are received from the 
speech encoder. If the pitch gain is higher than the constant TONE_THR, tone is detected and the tone flag 
is set: 

if (oljgain > TONE_THR) 

tone_flag = 1 

The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for 
mode 6.60 kbit/s, where it is computed only once. 

3.3.3 VAD decision 

The block diagram of the VAD decision algorithm is shown in figure 3. 



level[n] 



SNR 
Computation 



snr sum 



tone flag 



bckr_est[n] 



Background 

Noise 

Estimation 



Comparison 



I 



vad thr 



vadreg 



Hangover 
Addition 



VAD_flag 



Speech 
Estimation 



noise_le\ el sj eech_level 



Threshold 
Adaptation 



Figure 3. Simplified block diagram of the VAD decision algorithm 

Power of the input frame is calculated as follows: 



FRAME _ LEN 



frame _ pow = ^ s(i) * s(i) , 



(5) 



where samples s(i) of the input frame are pointed by the new_speech pointer of the speech encoder. 
Variable pow_sum is sum of the powers of the current and previous frames. If pow_sum is lower than the 
constant POW_TONE_THR, tone-flag is set to zero. 
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The difference between the signal levels of the input frame and the background noise estimate is calculated 
as follows: 

snr _sum = Y MAX (1 .0, level ^ n \ ) 2 , (6) 

„ =1 bckr _est\n\ 

where: 

level[n] signal level at band n 

bckr_est[n] level of background noise estimate at band n 



VAD decision is made by comparing the variable snr_sum to a threshold. The threshold (vad_thr) is adapted 
to get desired sensitivity depending on estimated speech and background noise levels. 

Average background noise level is calculated by adding noise estimates at each band except the lowest 
band: 

12 

noise _ level - ^ bckr _ est[n\ 

n-2 (7) 

If SNR is lower that the threshold (MIN_SPEECH_SNR), speech level is increased as follows: 
If (speechjevel/noisejevel < MIN_SPEECH_SNR) 
Speechjevel = MIN_SPEECH_SNR * noisejevel 

Logarithmic value for noise estimate is calculated as follows: 

i log 2_noise_level = log 2 (noise _lev el) (8) 

Before logarithmic value from the speech estimate is calculated, MIN_SPEECH_SNR*noise_level is 
subtracted from the speech level to correct its value in low SNR situations. 

i log 2_speech_level = log 2 (speech_level - MIN _ SPEECH _ SNR* noise _ level) (9) 

Threshold for VAD decision is calculated as follows: 

Vadjhr = NO_SLOPE * (Hog2_noise_level - NO_P1) + THR_HIGH + MIN(SP_CH_MAX, 
MAX(SP_CH_MIN, SP_CH_MIN + SP_SLOPE * (Hog2_speech_level - SP_P1))), (1 0) 

where NO_SLOPE, SP_SLOPE, NO_P1 , SP_P1 , THRJHIGH, SP_CH_MAX and SP_CH_MIN are 
constants. 

The variable vadreg indicates intermediate VAD decision and it is calculated as follows: 

if (snr_sum > vad_thr) 

vadreg = 1 
else 

vadreg = 

3.3.3.1 Hangover addition 

Before the final VAD flag is given, a hangover is added. The hangover addition helps to detect low power 
endings of speech bursts, which are subjectively important but difficult to detect. 

VAD flag is set to 'V if less that hangjen frames with '0' decision have been elapsed since burstjen 
consecutive 'V decisions have been detected. The variables hangjen and burstjen are computed using 
vad thr as follows: 
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hangjen = MAX(HANG_LOW, (HANG_SLOPE * (vadjhr - HANG_P1 ) + HANG_HIGH)) (1 1 ) 

burstjen = BURST_SLOPE * (vadjhr - BURST_P1) + BURSTJHIGH) (12) 

The power of the input frame is compared to a threshold (VAD_POW_LOW). If the power is lower, the VAD 
flag is set to '0' and no hangover is added. The VAD_flag is calculated as follows: 

Vadjlag = 0; 

if (pow_sum < VAD_POW_LOW) 
burst_count = 
hang_count = 
else 

if (vadreg = 1 ) 

burst_count = burst_count + 1 
if (burst_count >= burstjen) 

hang_count = hangjen 
VADJIag = 1 
else 

burst_count = 
if (hang_count > 0) 

hang_count = hang_count - 1 
VADJIag=1 

3.3.3.2 Background noise estimation 

Background noise estimate (bckr_est[n]) is updated using amplitude levels of the previous frame. Thus, the 
update is delayed by one frame to avoid undetected start of speech bursts to corrupt the noise estimate. The 
update speed for the current frame is selected using intermediate VAD decisions (vadreg) and stationarity 
counter (stat_count) as follows: 

if (vadreg for the last 4 frames has been zero) 

alpha_up = ALPHAJJP1 

alpha_down = ALPHAJDOWN1 
else if (stat_count = 0) 

alpha_up = ALPHAJJP2 

alpha_down = ALPHAJDOWN2 
else 

alpha_up = 

alpha_down = ALPHA3 

The variable stat_count indicates stationary and its purpose is explained later in this subclause. The 
variables alpha_up and alpha_down define the update speed for upwards and downwards, respectively. The 
update speed for each band "n" is selected as follows: 

if (bckr_est m [n] < level m _\n\) 

alpha[n] = alpha_up 
else 

alpha[n] = alpha_down 

Finally, noise estimate is updated as follows: 

bckr _ est m+1 [n] = (1 .0 - alpha\n\) * bckr _ est m [n] + alpha[n]* level m _ l [n], (1 3) 

where: 

n index of the frequency band 

m index of the frame 
Level of the background estimate (bckr_est[n]) is limited between constants NOISEMIN and NOISEMAX. 
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If level of background noise increases suddenly, vadreg will be set to "1 " and background noise is not 
normally updated upwards. To recover from this situation, update of the background noise estimate is 
enabled if the intermediate VAD decision (vadreg) is '1' for long enough time and spectrum is stationary. 
Stationary (stat_rat) is estimated using following equation: 

^ MAX (STAT THR LEVEL, MAX(ave level \n\ level [n])) 

stat rat =y = = = (14) 

t? MAX(STAT_THR_LEVEL,MWave_level m [nllevel m [n])) 

where: 

STAT_THR_LEVEL a constant 

n index of the frequency band 

m index of the frame 

avejevel average level of the input signal 

If the stationary estimate (stat_rat) is higher than a threshold, the stationary counter (stat_count) is set to the 
initial value defined by constant STAT_COUNT. If the signal is not stationary but speech has been detected 
(VAD decision is '1'), stat_count is decreased by one in each frame until it is zero. 

if (5 last tone flags have been one) 

stat_count = STAT_COUNT 
else 

if (8 last internal VAD decisions have been zero) OR (stat_rat > STAT_THR) 

stat_count = STAT_COUNT 
else 

if (vadreg) AND (stat_count * 0) 
stat_count = stat_count - 1 

The average signal levels (ave_level[n]) are calculated as follows: 

ave _ level m+1 [n] = (1 .0 - alpha) * ave _ level m [n] + alpha * level m [n] (15) 

The update speed (alpha) for the previous equation is selected as follows: 

if (stat_count = STAT_COUNT) 

alpha = 1.0 
else if (vadreg = 1 ) 

alpha=ALPHA5 
else 

alpha = ALPHA4 

3.3.3.3 Speech level estimation 

First, full-band input level is calculated by summing input levels in each band except the lowest band as 
follows: 

12 



in _ level = ^ level[n] (1 6) 



11 = 2 

A frame is assumed to contain speech if its level if high enough (MIN_SPEECH_LEVEL1 ), and the 
intermediate VAD flag (vadreg) is set or the input level is higher than the current speech level estimate. 
Maximum level (sp_max) from SP_EST_COUNT frames is searched. If the SP_ACTIVITY_COUNT number 
of speech frames is located in within SP_EST_COUNT number of frames, speech level estimate is updated 
by the maximum signal level (spjmax). The pseudocode for the speech level estimation is as follows: 

If (SP_ACTIVITY_COUNT > SP_EST_COUNT - sp_est_cnt + sp_max_cnt) 
sp_est_cnt = 
spjmaxcnt = 
sp_max = 
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sp_est_cnt = sp_est_cnt + 1 

if (injevel > MIN_SPEECH_LEVEL1) AND ((vadreg = 1) OR (injevel > speechjevel)) 
sp_max_cnt = sp_max_cnt + 1 
sp_max = MAX(sp_max, injevel) 
if (sp_max_cnt > SP_ACTIVITY_COUNT) 
if (sp_max > MIN_SPEECH_LEVEL2) 
if (sp_max > speechjevel) 

speechjevel = speechjevel + SP_ALPHA_UP * (sp_max -speechjevel) 
else 

speechjevel = speechjevel + SP_ALPHA_DOWN * (sp_max - speechjevel) 
sp_max_cnt = 
sp_max = 
sp_est_cnt = 

4 Computational details 

A low level description has been prepared in form of ANSI C-code [1]. 
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